Add video text to text docs#33164
Conversation
|
Caused by #31292, will work on it |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
amyeroberts
left a comment
There was a problem hiding this comment.
LGTM - thanks for adding! ❤️
zucchini-nlp
left a comment
There was a problem hiding this comment.
Yay, thanks for adding this! Looks good but I wan thinking of adding an inference with pure VideoLLMs, WDYT?
|
|
||
| Now we can preprocess the inputs. | ||
|
|
||
| This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess. |
There was a problem hiding this comment.
Hmm seems to be a typo, 8 frames each video makes total 12 frames?
There was a problem hiding this comment.
sorry I changed later to 7B model so should've modified this
| - chat fine-tuned models for conversation | ||
| - instruction fine-tuned models | ||
|
|
||
| This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model. |
There was a problem hiding this comment.
Maybe we could/should add an example with pure VideoLLM where we don't have to manually replicate image token several times and where the model has special treatment for videos, like extra pooling layers
llava-next-video or video-llava can be an option for that
There was a problem hiding this comment.
I thought most models are coming out as interleaved so actually using an interleaved example is good since they're harder to get started with. I can add simple VideoLLM example separately with chat templates though
There was a problem hiding this comment.
Yes, they are mostly interleaved. The difference with llava-interleave is that we didn't add a new model for that, so it's kinda an image LLM used for video. For all others I am trying to make two separate processors, for images and for videos, with their own special tokens
There was a problem hiding this comment.
okay I'll add a video only one and modify when you make the processors, does it sound good?
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
@zucchini-nlp re: slack discussions I'd say we merge this and edit when the processors are out. |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Yes, sounds good to me. We'll let users to discover how each model expects the inputs by model card, as there's no one standard yet and we don't support natively video-only LLMs
Approved, thanks! 💛
--------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Adding video-text-to-text task guide
@zucchini-nlp @NielsRogge @amyeroberts @stevhliu